BOOSTING SKULL-STRIPPING PERFORMANCE FOR PEDIATRIC BRAIN IMAGES

Skull-stripping is the removal of background and non-brain anatomical features from brain images. While many skull-stripping tools exist, few target pediatric populations. With the emergence of multi-institutional pediatric data acquisition efforts to broaden the understanding of perinatal brain development, it is essential to develop robust and well-tested tools ready for the relevant data processing. However, the broad range of neuroanatomical variation in the developing brain, combined with additional challenges such as high motion levels, as well as shoulder and chest signal in the images, leaves many adult-specific tools ill-suited for pediatric skull-stripping. Building on an existing framework for robust and accurate skull-stripping, we propose developmental SynthStrip (d-SynthStrip), a skull-stripping model tailored to pediatric images. This framework exposes networks to highly variable images synthesized from label maps. Our model substantially outperforms pediatric baselines across scan types and age cohorts. In addition, the <1-minute runtime of our tool compares favorably to the fastest baselines. We distribute our model at https://w3id.org/synthstrip.


INTRODUCTION
Skull-stripping is the isolation of the brain from surrounding anatomical features, noise, and background signal in neuroimaging data, for example acquired with magnetic resonance imaging (MRI).It is an essential pre-processing step for many neuroimaging analysis pipelines, in which downstream image processing tasks frequently rely on input images with non-brain tissue removed [1][2][3].These pipelines automate labor-intensive processing steps and eliminate subjectivity, enabling researchers to focus on data interpretation and accelerating the pace of discovery in neuroscience.
As neuroanatomy differs substantially between infants and adults, methods developed for the latter are not generally well-suited for younger cohorts.For example, the brain undergoes rapid development during the first two years of life [4].During this time the brain doubles in size and the gray-white matter tissue MRI contrast flips (6-9 months).Additionally, pediatric scans are prone to motion artifacts and commonly include parts of the shoulders and chest.These challenges motivate the development of dedicated algorithms for skull-stripping in pediatric populations.Related Work.There are many existing skull-stripping methods developed for adult brain scans, which leverage a variety of strategies.Some methods iteratively fit deformable brain surfaces to the image [5], while others determine the brain boundary using a combination of generative and discriminative models, such as Random-Forest classifiers [6].More recently, deep-learning (DL) approaches train deep neural networks to segment the brain [7], often building on U-Net architectures [8].
Few skull-stripping algorithms are tailored specifically to pediatric populations.Typically, these more recent DL methods either target a single MRI contrast [9] or train a different network for each of the available contrasts [10].One approach uses separate two-dimensional (2D) networks for axial, coronal, and sagittal views extracted from the same input volume before fusing predictions via a voting scheme [9].An- other method trains a 3D U-Net to operate on overlapping 3D patches of the input volume [10].Synthesis Strategy.A recent learning strategy trains neural networks without acquired images, producing models that robustly generalize across datasets and imaging modalities [11,12].Synthesizing diverse training images from label maps, prior work achieves state-of-the-art performance on registration [13][14][15] and segmentation [16,17].SynthStrip [18] leverages this approach for robust skull-stripping.Despite demonstrated performance across a large variety of images including pediatric MRI, SynthStrip is an age-agnostic tool that does not specifically target this younger population.
Contribution.We demonstrate that optimizing SynthStrip for pediatric populations leads to performance gains, essential for downstream pediatric neuroimaging pipelines, and helps meet specific pipeline requirements, such as the exclusion of cerebrospinal fluid (CSF) from brain masks [2].We build on SynthStrip's generative model and architecture to address the challenges of pediatric neuroimaging data.We create a novel set of pediatric label maps for training-image synthesis and use it to train a new skull-stripping model, developmental SynthStrip (d-SynthStrip).We thoroughly analyze d-SynthStrip's performance on real MRI scans across MRI contrasts and pediatric age groups.We also investigate networkarchitecture variations to identify an optimal training configuration that surpasses state-of-the-art pediatric solutions in accuracy.Our baseline comparison focuses on publicly available and readily usable tools that can be run without retraining.We will freely distribute our model at w3id.org/synthstrip as a stand-alone tool and as part of the upcoming FreeSurfer and Infant FreeSurfer releases.

METHODS
We implement the supervised SynthStrip framework [18] for skull-stripping and tailor it to pediatric neuroimaging data.Let x be a 3D gray-scale image.A deep convolutional network (CNN) g θ with trainable parameters θ predicts the binary brain mask ŷ = g θ (x), such that a voxel-wise multiplication yields the skull-stripped image Instead of training with real images, the framework draws a pre-computed whole-head label map s at each optimization step and synthesizes head scan x with randomized intensity features from it.Each step updates parameters θ to minimize a loss L(y, ŷ) that encourages similarity between ŷ and the target brain mask y, derived from the brain labels in s.Training and Validation Data.We assemble a local dataset (MGH) from (i) 29 Infant FreeSurfer [2] training images (ii) 18 newborn scans [19,20] and (iii) the M-CRIB atlas cohort (N =10) [21].We select these 3 sources to cover a wide age range of 0-56 months (Table 1) and maximize variability across the included structural T1-weighted (T1w) and T2- We emphasize that we train d-SynthStrip with images synthesized from label maps rather than the label maps themselves.We create training label maps by combining manually drawn brain labels with an additional six labels across the non-brain image content, produced by fitting a Gaussian mixture model (GMM) [18] to the intensities of each image.The added labels have no neuroanatomical significance -we include them in training to synthesize more variable image content.For a balanced distribution of the GMM labels across the image, we apply non-uniformity correction to the image intensities prior to the GMM fit [1].For each image, we replace GMM-fitted labels that fall inside the brain boundary with the manual labels to produce a single label map.
Generative Model.At each training step, we sample s from the set of training label maps [18].First, we augment the spatial variability of s by applying the composition of a random affine (including translation, rotation, scaling, and shear) and nonlinear transform.Second, we sample a mean intensity value for each label and an overall variance.Then we sample intensity values for each voxel of the label from the corresponding normal distribution to generate gray-scale image x.Third, we apply an array of randomized corruptions including a spatially-varying intensity bias field, global intensity exponentiation, cropping, downsampling, and Gaussian blurring.These steps produce highly variable training data with complex intensity patterns across the image voxels of each label, including and also far exceeding the variability seen in medical images (Figure 2).
From the spatially augmented label map, we also derive ground-truth brain mask y.First, we merge all brain labels excluding non-ventricular CSF to form a binary map.Second, we fill and include the space between brain folds into the brain mask, via 10 iterations of dilation followed by 10 iterations of erosion using nearest-neighbor connectivity.Third, we fill any remaining 3D holes.The resulting brain mask y serves as the target for the network prediction in the loss function.
Architecture and Loss.We use the 3D SynthStrip U-Net [18] architecture.The U-Net g θ has seven resolution levels with two leaky-ReLU activated 3 × 3 × 3 convolutions per level.It outputs two softmax-activated channels j and k for brain and background, respectively.We optimize g θ using a Dice-based loss L Dice , which measures the overlap between the target brain mask y and the predicted mask ŷ: where we sum over all voxels v ∈ Ω of the spatial domain Ω of image x.In our experiments, we also analyze another model variant [18], which predicts a signed distance transform (SDT) d representing the distance to the brain boundary at each voxel.We optimize the mean squared error (MSE) from the target SDT d computed from y.To focus the optimization gradients on the brain boundary, we down-weight the MSE contribution of voxels farther than distance h from this boundary by a factor of b [18].Training Details.We use 50 label maps from the MGH dataset for synthesis-based training and the remaining 7 real MR images for validation.We train our d-SynthStrip models with stochastic gradient descent using Adam with a batch size of 1, until the loss on the validation set plateaus.We conform all images and label maps to 256 3 volume size with 1 mm 3 isotropic voxels and left-inferior-anterior orientation using linear interpolation.

EXPERIMENTS
To assess the skull-stripping performance of our models, we compare them against state-of-the-art baseline methods across MRI contrasts and age groups.
Test Data.We select 20 subjects from the UNC/UMN Baby Connectome Project (BCP) [22] and another 20 subjects from the Developing Human Connectome Project (dHCP) [23] to form a test cohort of N =40 subjects.For each subject, we source a T1w and T2w MR scan along with a label map which corresponds to both images (except 1 BCP subject, for which we have no T2w image).For the BCP cohort, we manually review and correct label maps generated with the Infant FreeSurfer pipeline [2].We obtain label maps for the dHCP cohort using the dHCP minimal processing pipeline [24].Table 1 displays the age distribution for each cohort.
Baselines.We compare our tool to well established skullstripping methods.First, we test SkullStripping CNN (SS-CNN) [9], which targets T1w pediatric MRI.Second, we test the skull-stripping module of the Infant Brain Extraction and Analysis Toolbox (iBEAT) [10] developed for T1w and T2w MRI (version 2.0, release 120).Third, we test SynthStrip [18] version 1.5, with the -no-csf flag in order to match the masks predicted by all other methods, which exclude nonventricular CSF.Finally, we test deepbet [25] version 0.0.2.
Although deepbet focuses on T1w adult MRI, we include it as another DL solution due to its demonstrated performance [25].As deepbet and SSCNN are tailored specifically to T1w MRI, we do not evaluate them on T2w images.
Metrics.We evaluate skull-stripping accuracy relative to binary ground truth masks using volumetric Dice overlap scores and Hausdorff distances between brain-mask boundaries.
Setup.First, we assess the brain-masking accuracy of each tool across MRI contrasts and age groups.Second, we analyze the two different architectures: we compare a traditional segmentation model with a Dice loss to SDT prediction with an unweighted (uSDT, b = 0 mm) and a weighted SDT loss (wSDT, b = 10 −3 , h = 4 mm) from Section 2.
Results. Figure 3 shows that d-SynthStrip trained with a Dice loss outperforms other skull-stripping methods regardless of contrast or subject cohort.Figure 4   In terms of Hausdorff distances, both our Dice and SDT models outperform all baselines tested, while the Dice model generally surpasses the SDT variants.SynthStrip closely follows SSCNN.While iBEAT struggles with the BCP data, it achieves the lowest Hausdorff distances among baseline methods for dHCP.
On an NVIDIA RTX 8000 GPU, d-SynthStrip, Synth-Strip, and deepbet take less than 1 minute per image, including model setup.However, d-SynthStrip inference alone takes less than 1 second.SSCNN takes approximately 15 minutes, while iBEAT requires up to 22 hours -skull-stripping results are not available before the full pipeline completes.

DISCUSSION
We present a pediatric brain extraction tool, d-SynthStrip, that outperforms specialized baseline skull-stripping methods on images acquired from newborns to toddlers.
While the synthesis strategy previously proved to produce networks that robustly generalize across patient populations, we demonstrate the benefit of synthesizing training data from label maps of a targeted population.d-SynthStrip outperforms SynthStrip by up to 10 Dice points and up to 20 mm Hausdorff distances on infant data.This difference in performance suggests that the synthetic scaling and deformations applied during synthesis may insufficiently cover the distribution of developing brain shapes.
While prior work shows similar skull-stripping accuracy between models trained with Dice and SDT losses [18], we find the Dice loss to lead to increased Dice scores at test time.This result is not surprising, and we plan to investigate receiver operating characteristic (ROC) curves in the future for a more comprehensive comparison of the two losses.
In addition, we will explore whether increasing the variability of the generative model, specifically the synthetic warps applied to input label maps, may bridge the performance gap to yield accurate masks across both pediatric and adult populations with a single model.We will also investigate whether a model trained with a dataset carefully balanced to cover the whole lifespan can robustly accommodate both pediatric and adult brain scans.

COMPLIANCE WITH ETHICAL STANDARDS
The retrospective analysis of the MGH data required no ethical approval.We signed data use agreements for access to the publicly available BCP and dHCP data.

Fig. 1 .
Fig. 1.SynthStrip-based training framework.Starting with manual brain label maps, we synthesize widely variable brain images and matching ground-truth brain masks, which we then use to train the model.
Figure 1 provides an overview of the learning framework, while Figure 2 shows training-image examples.

Fig. 2 .
Fig. 2. Synthetic training images generated from pediatric label maps.The spatial and intensity variability deliberately exceeds the range of medical images to encourage d-SynthStrip to generalize across MRI contrasts and age groups.

Table 1 .
Age distribution.The dHCP cohort includes preterm and term newborns, listed with gestational age (GA) at scan.
Fig.3.Brain extraction accuracy in terms of Hausdorff distance and volumetric Dice overlap.Testsets listed in Table1.